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Abstract 

Many people are familiar with the physico-chemical properties of gene 
sequences. In this paper I present a mathematical perspective: how do 
mathematical principles such as information theory, coding theory, and 
combinatorics influence the beginnings of life and the formation of the 
genetic codes we observe today? What constraints on possible life forms 
are imposed by information-theoretical concepts? Further, I detail how 
mathematical principles can help us to analyse the genetic sequences we 
observe in the world today. 

1 Introduction 

Genetics is concerned with the physical characteristics of organisms that are 
passed on from one organism to another through the use of deoxyribonucleic 
acid (DNA), consisting of a sequence of nucleotides. The nucleotides arc the 
chemical bases adenosine, thymine, cytosine and guanine that are denoted using 
the alphabet {A, T, C, G}. Those on one strand are paired in a complementary 
fashion with those on the other strand, where adenosine matches with thymine, 
and guanine with cytosine. Groups of three bases are called codons, and these 
encode the twenty amino acids that combine to form proteins, the building 
blocks of life. In a nutshell, the central dogma of molecular biology states that 
"DNA makes RNA makes protein" . This is encapsulated in Figure ^ The 
DNA is transcribed into complementary messenger ribonucleic acid (mRNA). 
In RNAs, the alphabet is {A, T, U, G} where uracil plays the same role that 
thymine does in DNA, as it pairs with guanine. Sections of the mRNA that do 
not code for proteins are removed, and a "poly- A tail" — a sequence composed 
entirely of adenosine bases — is added to (chemically) stabilise the sequence. The 
mRNA then acts as a template for protein synthesis. Transfer RNAs (tRNAs) 
bind to an amino acid on one end, and a complimentary set of three bases on 
the mRNA template. A ID sequence of amino acids forms and is then detached 
from the tRNAs and folds into a 3D structure. This sometimes occurs by itself 
and sometimes with the aid of other proteins, either immediately or at a later 
date in the life of the cell. 

There are several key areas in which mathematical principles underlie, influ- 
ence, and can provide information about genetic structures. The key questions 
that these principles can help answer are 

• Why do we have four bases, a triplet coding and twenty amino acids? 



• Why do we observe the particular assignment of triplets to amino acids 
that we do? 

• How do new gene sequences arise, and how do they spread in a population? 

• How can we analyse the sequences that arise? 

Some mathematically-based answers are discussed in the remainder of this pa- 
per. 



DNA is transcribed to mRNA is translated to Protein 




Figure 1: The central dogma of molecular biology states that "DNA is tran- 
scribed into messenger RNA, which is then translated into protein." This dia- 
gram also shows DNA replication, which is done with the aid of a number of 
proteins. At the mRNA stage, introns are spliced out from the sequence, leaving 
only the protein coding exons. This dogma is of course vastly simplified, for ex- 
ample there is added complexity through splicing, RNA-only genes, RNA-RNA 
interactions, prions, and other details (|l|;|2|). But in its essential form this does 
describe the flow of information in a cell. 



2 Number of bases and amino acids 

The following is a summary of the work of Soto and Toha who took a 
mathematical approach to the question of why four bases, a triplet coding, 
and 20 amino acids are used, based on the assumption that nature will, over 
evolutionary time, find a solution to the problem that minimises the amount of 
cell machinery. It also assumes that the machinery is not unlike that used by 
computer memory chips to decode Q 0)- This is not a bad assumption, but 
leaves out chemical tricks that that the tRNAs can use. I also use the fact that 
optimal solutions, since they have an advantage in evolutionary terms, spread 
in a population as I explore later. The main argument of Soto and Toha is as 
follows: 

Firstly, they define the maximum number of amino acids as 

N= P n , (1) 

where p is the number of possible bases (symbols of length 1) and n the number 
of positions. For example, the amino acid codings used in all living things has 
p = 4 bases and n — 3 positions, a triplet code. This gives a total of N = 64 
possible amino acids. For the assumptions above, it turns out the amount of 
"hardware" , or cell machinery, is proportional to the number of bases times the 
number of positions, which can be written as 

hardware oc p x n, (2) 
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where p and n are as denned above. It also turns out that to minimise the 
amount of hardware, one can write this number of amino acids as, 

A = e\ (3) 

where e = 2.718... is the base of the natural logarithm, and describes many 
growth and decay processes that occur in the natural world, and x is the number 
of positions. So we need to have the number of bases close to e, thus optimising 
the number of positions for a given N by setting 

x = In N. (4) 

Then we can find a semi-optimal N by, 

n = (N^yy, (5) 

where y is the actual number of positions used, resulting in a degeneracy, 

"-(!)'■ w 

where p is the number of actual bases used and b is the minimum (integer) 
possible. Then the actual amount of hardware used is 

B = b x y = yN 1/v , (7) 

and we write the difference between this amount of hardware, and the optimal, 

A = e x , as 

A = B-A>0, (8) 

where A is the difference in "hardware" between the actual and optimal solution, 
and this is always greater than zero as we can approach but never achieve the 
minimal amount of "hardware" (since this would require a non-integral number 
of bases. If we set the derivative, or rate of change of A, 

to zero, this allows us to find the optimal solution for the number of amino acids 
for fixed number of base positions. A graph of A is shown in Figure |2 showing 
the minima for one, two and three positions occurring at three, seven, and 20. 
This assumes four bases are used The actual minima, and for the best possible 
choice of number of bases, are shown in Table ^ again, indicating 20 amino 
acids is the optimal number. 

Having established that 20 amino acids is best, we can then turn to the 
problem of why four bases are used. It is shown above that the best we can do 
is to get as close to e bases as possible, by choosing three amino acids. Two 
bases would require more positions (and more hardware) than three, or four. 
There are two main reasons why four bases is actually a better choice than 
three, however: 

1. Four bases allows a complimentary pairing, for accurate, fast and efficient 
replication of genetic material. 

2. On the hypothesis that there is a precursor genetic code, using fewer po- 
sitions and coding less than 20 amino acids JjJ), then this evolutionary 
pathway is actually easier (in terms of more efficient in hardware) if four 
bases are used. Table 121 shows the corresponding lower values of A (nor- 
mally lower for three, but not always as this Table shows). 
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Figure 2: Plot of the distance between the actual hardware and optimal hard- 
ware, on the vertical axis, for a different number of amino acids. 



Table 1: This table shows the optimal number of amino acids for 1-4 base 
positions, and the corresponding (minimal) difference between the actual coding 
and the theoretical minimum A — e x . 



y (number of positions) 


N (number of amino acids) 


rounded N 


A 


1 


2.718 


3 


0.0136 


2 


7.389 


7 


0.0019 


3 


20.086 


20 


8.2 x 10" B 


5 


54.598 


55 


18 x 10~ B 



Table 2: The following values of A, lower for four bases than three, indicate the 
pathway to 20 amino acids and four bases. 



y (number of positions) 


A for 3 bases 


A for 4 bases 


4 


0.2317 


0.2317 


10 


0.2042 


0.0655 


11 


0.1538 


0.1151 
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3 Gray coding 



Mitochondria are organelles that live inside each cell, and provide the cell with 
energy. They have their own genetic code, independent of the normal set of 23 
pairs of chromosomes that reside in the cell nucleus. They also use a different 
way of coding for the amino acids. The mitochondrial code for vertebrates is 
shown in Table |31 This differs slightly from the normal code used in nuclear 
DNA in that the "wobble rule" is exact. This means, that, for some particular 
choices of the first two bases, if we change the third base, we end up with the 
same amino acid in the code. This is important for reducing the cell machinery 
needed for decoding, as discussed above, but since mutations are occuring, we 
would like to have the result that a single mutation in a triple (which is a number 
of times more likely than a double mutation in a triplet) results in either no 
change in the amino acid, or to a very chemically similar amino acids. 

Table 3: Table of the RNA triplets and the corresponding (abbreviated) amino 
acid. The "wobble rule" can be seen as when the third base changes, we usually 
stay on the same amino acid, or in some cases move to a chemically related one. 



UUU Phe 


UCU Ser 


UAU Tyr 


UGU Cys 


UUC Phe 


UCC Ser 


UAC Tyr 


UGC Cys 


UUA Leu 


UCA Ser 


UAA Stop 


UGA Tip 


UUG Leu 


UCG Ser 


UAG Stop 


UGG Trp 


CUU Leu 


CCU Pro 


CAU His 


CGU Arg 


CUC Leu 


CCC Pro 


CAC His 


CGC Arg 


CUA Leu 


CCA Pro 


CAA Gin 


CGA Arg 


CUG Leu 


CCG Pro 


CAG Gin 


CGG Arg 


AUTJ He 


ACU Thr 


AAU Asn 


AGU Ser 


AUC He 


ACC Thr 


AAC Asn 


AGC Ser 


AUA Met 


ACA Thr 


AAA Lys 


AGA Stop 


AUG Met 


ACG Thr 


AAG Lys 


AGG Stop 


GUU Val 


GCU Ala 


GAU Asp 


GGU Gly 


GUC Val 


GCC Ala 


GAC Asp 


GGC Gly 


GUA Val 


GCA Ala 


GAA Glu 


GGA Gly 


GUG Val 


GCG Ala 


GAG Glu 


GGG Gly 



We then have the problem of assigning the 20 amino acids (including that 
for the amino acid methionine, which doubles as a start codon) plus a stop 
codon to the set of 64 codons, such that a change by one base in the codon, 
results in minimal change in the amino acid, even in the non- "wobble rule" (3rd 
base) position. The problem of doing this has shown by (|6J) to be an equivalent 
problem to the travelling salesman problem — that is, solving one problem gives 
you a solution to the other problem. The travelling salesman problem is a very 
famous problem in the areas of discrete combinatorics — how to solve problems 
of arrangement of items — and in the theory of computational complexity, where 
we are interested in how long it takes to solve a difficult problem. The travelling 
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salesman problem is how to visit a group of cities, visiting each city once, and 
going back to the starting city, all for minimum cost. It also turns out that this 
is an equivalent problem to the Towers of Hanoi game, in which one attempts 
to shift a set of discs stacked from largest up to smallest from one tower onto an 
one of two other towers, shifting only one disc at a time, and making sure that a 
disc is never above a disc of smaller size. Thus, it might be said, to paraphrase 
and extend a quote by Neils Bohr, that "if God exists, then not only does he 
play dice with the Universe, but he also plays Towers of Hanoi with the living 
creatures within it." It should be noted that there are other mathematical, 
biological, and mixed mathematical/biological reasons why the existing codes 
(both standard and the various mitochondrial and other codes) are optimal (0) 
and also why there are differences between the various codes 



4 Game theory and cheating husbands 

To answer the question of how genetic structures arise I considered a gene "for" 
infidelity. When I (and others, like Dawkins (0) talk of a gene being "for" some- 
thing, I am saying, all other things (including other genes) being equal, that this 
gene influences behaviour through different types (or chemical concentration) 
of a protein. It is also important to remember that along with genes, cultural 
and family background also play a large role in determining behaviour. The 
culture also sets up societal systems for taking care. Women face a number of 
tradeoffs in selecting mates, both for the long term and short term jiot HH) and 
these are highly dependent on culture, although it should be noted that some 
of these are also faced by females across the animal kingdom. Some example 
scores (or relative advantages or disadvantages) for various male and female 
strategies are shown in Table 01 This assumes a society in which women are 
the predominant child raisers. If she can stay in a monogamous relationship, or 
"get away" with cheating and still have a husband around, then this is "better" 
(1 — (—2) > 0) than if she doesn't have a husband around. On the other hand, 
a male benefits his genes more and more, the more he is unfaithful, since the 
women will be raising his children. So it is not surprising that we therefore 
need a mixed strategy (which doesn't have to be used exclusively for a whole 
lifetime, although I simplified my simulation by doing that) in which there are 
some (women) who remain in semi-monogamous relationships, but there is a 
population of (mostly) men who cheat a lot. This is shown in Figure which 
I generated by performing a computer simulation capturing all of the above 
details. This whole model ignores emotions, but then in evolutionary terms, 
the emotions don't matter much here, since they mainly occur after children 
have already been raised. Thus we expect nature to not really care either way 
if people get hurt. 

Of course, this is not the only possible system where game theory helps us 
determine evolutionary stable solutions to the problems that organisms face, this 
has also been shown for the threats from disease (both genetic and infectious 
disease), we then find that these good solutions spread through a population: 
even if they confer a slight advantage, then over time they will spread. 
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Table 4: This table details the scores, or reproductive benefits to males and 
females who adopt various strategies: (1) playing around — mainly monoga- 
mous but with some cheating, (2) strictly monogamous, and (3) always playing 
around — never monogamous. We assume a society in which women are the pre- 
dominant child rearers (found in some cultures but not others) . Scores are only 
relative in determining overall success, for example, it wouldn't matter if we put 
-3 instead of -2, it is still less than and so we expect that the —3 strategy will 
lose. 



Gender 


Fidelity 


Score 


Male 


Monogamous 





Male 


Plays around 


l 

2 


Male 


Serial cheater 


1 


Female 


Monogamous 


1 


Female 


Plays around 


1 


Female 


Serial cheater 


-2 



70 




40 



generations 



Figure 3: Graph showing the population of people (1) playing around — mainly 
monogamous but with some cheating, (2) strictly monogamous, and (3) always 
playing around — never monogamous. Random fluctuations are present, due to 
the random aspects of the simulation. In general we obtain a population mix of 
people who are never monogamous or monogamous plus cheating. We expect 
this because there is an advantage to men to cheat, and neutral if women cheat. 
We don't expect a pure population of total cheaters because this would be 
penalised in general through the -2.0 score for women. 



7 



5 Entropy and introns 



In this section I will first introduce the topic of entropy, and then discuss how 
it applies to the introns, the parts of genes that are cut out of the transcribed 
mRNA sequence template before the protein is made. Entropy is also discussed 
later on, as it can also be used to analyse the mathematical properties of existing 
sequences. 

5.1 Entropy 

Entropy is a measure of the amount of order or disorder in a sequence, which 
can be thought of as the information (ignoring context). The mathematical 
formula is 



where x denotes different symbols from the set of symbols in a sequence, X, 
and the p x is the probability of finding a symbol, or simply the number of 
times it occurs divided by the total number of symbols in the sequence. For 
example, the sequence X = aact has p a = 2/4, p c = p t = 1/4, and thus 
has entropy = - (2/4 log 2 (2/4) + 2 x l/41og 2 (l/4)) = 1.5 bits (the same bits 
that computers use) of information. A related topic to the Shannon entropy 
is Chaitin-Kolmogorov entropy. This is the "algorithmic" entropy, that is de- 
fined in terms of the shortest computer program that could reproduce a given 
sequence. This is related to the Shannon entropy (ideally it should approach, 
or get close to, the measure of the Shannon Entropy). We can consider the 
Chaitin-Kolmogorov entropy as being like a self-extracting ZIP (computer) file: 
the data is compressed, and a short program is attached which can then de- 
compress the compressed data when the self-extracting file is run. I show below 
that this is similar to what occurs in introns 

5.2 Introns 

Entropy can enlighten us on two key things: evolutionary advantages for introns, 
and also on patterns found in specific existing genes. The former is discussed 
here, and the latter is discussed in the following section. 

If we write consider each protein as composed of distinct functional modules 
(true for many, but not all proteins) then we often find other proteins containing 
the same modules. If we can write these alternative proteins as a single gene, 
with alternative splices, then we can increase the Shannon entropy, since there 
is less redundancy (and thus the probabilities of finding various bases are more 
even). This also increasess the Chaitin-Kolomogorov entropy, if we can use this 
alternative splicing a lot, in comparison to the extra genes we need to encode for 
this alternative splicing machinery — an "algorithm" to unpack the alternative 
splices from a single gene. In general, if the entropy of a system increases, the 
complexity increases (not always true since a true random signal has a very low 
complexity), and this leads to increased adaptability (but trades off reliability) . 

The need to have minimal machinery here again guides us as to the evo- 
lutionary solution found. If we have some systematic way of marking where 
these modules, or exons, start and stop in genes, then we can use the same 




(10) 
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set of cellular machinery repeatedly. This then allows a greater degree of free- 
dom in terms of the instructions that can be coded for, since we can include 
non-(protein-)coding instructions in these introns. As a very simple example 
of this, it has been showing that increasing the intron length can decrease the 
probability (or in other words, the final amount of protein) of containing the 
exon immediately after that intron. 



6 Mathematical analysis of genetic structures 

Mathematics not only underpins genetic structures but it can also be used to 
analyse genetic structures in existing organisms. The following is an excerpt 
from my paper on using mutual information to analyse DNA sequences. Mu- 
tual information is like Shannon information above, except for two sequences. 
Basically it describes the total information covered by two sequences, say X and 
y, making sure to not double count the information they have in common. The 
mathematical formula is 



i{x-y) = H{x)+H{y)-H{x,y), (11) 

where H(.) is the Shannon entropy defined above in Eg. I1UI 

A mathematical for showing the existence of long-range correlations in DNA 
is to use the mutual information function, as given in Eq. 1121 below. This 
approach has been shown to distinguish between coding and non-coding re- 
gions l|l2h . We explore the use of the the mutual information function given in 
Eq.Hl 

^) = EE^( d ) lo s 2 ^#' (12) 

for symbols a, € A (in the case of DNA, A = {a, t, c, <?}). P a p(d) is the 
probability that symbols a and (3 are found a distance d apart. This is related 
to the correlation function in Eq. El (.13): 



r (°0 "EE a a apP a p(d) - «« P « . ( 13 ) 

a£Af3eA \aeA ) 

where au and ap are numerical representations of symbols a and 0. As discussed 
by Li l[13l) . the fact that we are working with a finite sequence means that this 
M{d) overestimates the true M T (d) by 

M(d)-M T (d)* K{ K~ 2) , (14) 

where K is the number of symbols (for DNA this is always 4) and N is the 
sequence length. 

An example of applying this method to a real sequence of (mouse) DNA is 
shown in Figure 01 clearly showing the existence of long-range correlations. It 
is not altogether clear why these correlations exist across proteins, it may be 
due to variants of functional modules, stringed together to make a protein, or 
it may be due to interesting structures in introns. 
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Figure 4: This figure shows the plot of the mutual information function M(d) 
in Eq. ^] against base distance d for the sequence of the MAP kinase-activated 
protein kinase 2 gene from Mus musculus (in plain English: a mouse protein), 
shown in a darker line style, compared with the set of 100 randomized sequences 
of the same base distribution, the lighter band. The graph of mutual information 
in the MAP kinase gene mostly sits about the "noise floor" of the randomized 
sequences, in which the correlations have been destroyed 
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7 Conclusions 



Mathematics presents us with powerful tools, such as Entropy, and Game The- 
ory, that enlighten us as to what sort of genetic structures exist, how they evolve, 
and how we can analyse them. In particular, I have shown mathematical argu- 
ments for: 

• why four bases, a triplet code, and 20 amino acids are use, 

• why the triplets code for the 20 amino acids (and start and stop codons) 
in the way they do, 

• why introns are expected to evolve, and how they can be used to give 
increased flexibility, 

• how optimal solutions to evolutionary problems spread in a population, 
and 

• how to analyse genetic structures. 
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